Dramatic Improvements to Feature Based Stereo 


V.N, Smelvansky, R.D. Morris, F.O. Kuehnel, D.A. Maluf, and P. Cheeseman 

NASA Ames Research. Center, MS 269-2, Moffett Field, CA 94035, USA 
[vadim, r dm, knehnel .maluf , cheesem] Qemail . axc.nasa.gov 


Abstract. The camera registration extracted from feature based stereo 
is usually considered sufficient to accurately localize the 3D points. How- 
ever, for natural scenes the feature localization is not as precise as in 
man-made environments. This results in small camera registration er- 
rors. We show that even very small registration errors result in large 
errors in dense surface reconstruction. 

We describe a method for registering entire images to the inaccurate 
surface model. This gives small, but crucially important improvements 
to the camera parameters. The new registration gives dramatically better 
dense surface reconstruction. 


1 Introduction 

The goal of surface recovery is to take a set of images and estimate the positions 
and orientations of the cameras that produced the images, and a representation 
of the surface that was imaged. This is an example of an inverse problem . The 
forward (or direct) problem is: given a surface and the position and orientation 
of a camera, what is the expected image? This is the area of computer graphics 
knowm as rendering [1]. The inverse problem is: given a set of images, estimate the 
position and orientation of the cameras, and the shape and reflectance properties 
of the surface. That is, estimate a generative model [2,3]. 

The conventional feature based approach to 3D surface reconstruction takes 
a sparse set of corresponding feature points from which the positions and ori- 
entations of the cameras are estimated. The quality of the camera calibration 
crucially depends on well localized features. Feature tracking in a sequence of 
images with small frame to frame disparity has been demonstrated successfully. 
The two main concerns are the robustness and the accuracy of such an approach. 
Robustness is usually improved by tracking across a sequence with small inter- 
frame displacements , but for many applications this .cannot be assured. A fu rther 
concern is that the overall accuracy of the reconstructed 3D model from a sparse 
point could is rather doubtful and prior knowledge is not easily incorporated in 
the conventional reconstruction scheme. 

We show that a robust and accurate reconstruction scheme that can incor- 
porate any prior knowledge can be implemented by applying Bayesian inference 
of the underlying model space. We postulate models for the surface and for the 
imaging process, and Bayes theorem tell us how to estimate the parameters of 
these models from the image data. We show that this approach allows us to make 



small but crucially important improvements to the camera parameters estimated 
from point matching. These improvements result in a dramatic improvement in 
the accuracy of the 3D surface model. 

In this paper we restrict our reconstruction to simple surface models (no 
occluding parts), therefore we use a simple triangulated mesh model for the 
geometry of the surface, storing heights, z, at each vertex of the mesh. We also 
associate a parameterized reflectance model with the surface. For simplicity here 
we consider the Lambertian model, and store a single albedo value, p at each 
vertex. (For multispectral data we store an array of albedo values, one for each 
spectral band.) 

We use the standard pinhole camera model for the image formation process 
[9], and assume that the internal camera parameters are known. (See, for ex- 
ample, [10] for a simple method of internal camera calibration.) The theoretical 
development of our approach can be generalized to other imaging geometries 
and surface reflectance models. 

The closest work to that described here is in [4,3]. That work also used a 
triangulated mesh as the surface representation. The cost function they used is 
based on minimising the variance of the grey levels of the vertices’ projection into 
the images, rather than the direct image error that is used here. The approach in 
[4, 3] is thus restricted to triangulated meshes that are coarse when projected into 
the images. The approach described here places no restrictions on the density of 
the mesh, which may be super-resolved [5]. The system in [4, 3] is also restricted 
to cases wher the lighting was from the same direction in all images. Here we 
require only that the lighting direction is known. 

Thus we wish to infer the heights, z, the albedos, p and the camera parame- 
ters, 0, from the images. Bayes theorem gives 

p(z,p,0\{I}) ocp({I}\z,p,9)p(z,p,0) (1) 

We assume that the priors are independent, so that 


p( z. p,0) = p(z)p(p)p(&) 


and currently we use a simple smoothness prior for z and p based on penalizing 
curvature, and a uniform prior on 6. These initial prior assumptions are made for 
the sake of simplicity, and are not fundamental to the approach (see below). The 
likelihood is assumed to result from Gaussian errors between the image /( z, p, 0) 
^nthesizedlrom“the surface model and the observed images giving 


p({J}|z,p,6>) oc exp 



4 P ( z ,/*, 0/))) 2 /(2al) 


( 2 ) 


where the sum is over all pixels, p in all images, If. The surface parameters, z, 
p, are clearly shared between all images. Each image has its own set of camera 
parameters, Of, 


The function 7(z, p,6>) is the process of rendering the surface described by 
{z , p} with the camera location and orientation given by 0. This is clearly non- 
linear, and makes optimization of the posterior distribution in equation 1 dif- 
ficult. To make progress in finding the maximum a-posteriori (MAP) estimate, 
we linearize the image formation process about the current estimate, 

/(z,p,0) = /(uo) + Dx (3) 

where u = {z, p, Q} ) x = u — uo and 

fa/ si df_ 

“ [dz 7 dp' 80 

If we use a Gaussian smoothness prior with covariance matrix £ as described 
above then the linearization converts finding the MAP estimate to the nrinimiza- 
tion of a quadratic form 


L = ^x r Ax - bx 
2 

A = IT 1 + ^DD t 

b= A % 

which is equivalent to the solution of the system of equations 

Ax — b (7) 

Note that for this linearization to proceed, the only restriction on the smooth- 
ness prior is that it can be expressed as a covariance matrix. This makes no 
assumption of spatial uniformity; indeed the prior can easily be made spatially 
adaptive, to allow for the formation of discontinuities in the heights and albedos. 

Consider the structure of this system of equations. The matrix of derivatives 
D is of dimensions 


( 4 ) 

( 5 ) 

( 6 ) 


(no. of pixels) x (no. of heights -b no. of albedos + 

no. of camera parameters) (8) 

or, for the results presented later 

(256 x 256) x (301 x 301 -b 301 x 301 + 6) 

The portion of this matrix that is due to the differentials with respect to z and 
p is very sparse, as typically each mesh vertex is used by a few of the triangles 
that make up the surface, and these triangles project into only a few pixels. 
The portion due to the differentials with respect to the camera parameters is, 
however, dense, as changing any one of the camera parameters typically affects 



the intensities of all the pixels in the image. As a result of this, DD r and hence A 
are very large (around (180, 000 x 180, 000) and dense (around 3 x 10 10 elements). 
It is clearly impractical to perform joint estimation in this manner. Instead we 
estimate alternately the camera parameters and the surface parameters, that is 

given (9, estimate {z, p} 

given {z ,/>}, estimate 0 (9) 

In this way we compute either with a very large, but very sparse matrix when 
estimating z and p, or with a very small, dense matrix when estimating 0. The 
estimates are made by using conjugate gradient to solve equation 7 in an iterative 
manner. At convergence, we update the current estimate, ui = uo -f x, re-render 
to compute new values of 7(z,p, O) and D, and repeat the solution of equation 
7 until a stable solution is reached. This optimization process can be applied in 
a multiresolution framework, to both accelerate and improve convergence. 

This requires an initialization for. either O or {z,p}. We use initial values 
for O from point matching a very small number of points, or from nominal 
camera position and orientations, if they are known (eg from rover or aircraft 
dead-reckoning). In the experiments described later, point matching was used. 

The remainder of the paper is organized as follows: In section 2 and 3 we 
describe the basic rendering algorithm for a Tenderer which efficiently computes 
the images and calculates the derivative values used for the conjugate gradient 
search outlined in section 1. Results and Conclusions are given in sections 4 and 
5 

2 The Fractional Derivative Renderer 

As we have seen, to solve the inverse problem we must be able to simulate 
the forward problem, to compute /(z,p,<9), (“rendering”). Current rendering 
technology uses “image space” computation, where the fundamental unit is the 
pixel. Each pixel is assumed to be illuminated by light from one, and only one, 
triangular facet. This assumption makes for very fast rendering, but results in 
aliasing artefacts. It also makes the rendering process non-differentiable. 

To enable a renderer to also compute derivatives it is necessary that all 
computations are done in “object space”. This implies that the light from a 
surface triangle, as it is projected into a pixel, contributes to the brightness of 
that pixel with a weight proportional to the fraction of the area of the triangle 
■which- projects -into that pixel. -The .total-brightness. of the pixel is thus the sum 
of the contributions from all the triangles whose projections overlaps with the 
pixel ■ 

4 = £/a^a, (io) 

A 

where f A is the fraction of the flux from triangle A that falls into pixel p, given 
by 

tP . ^-polygon 

/a ~ _ 4 ^“ 


( 11 ) 
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Fig. 1. Geometry of tlie triangular facet, illumination direction and viewing direction. 
z s is the vector to the illumination source; z v is the viewing direction. 


where A denotes projected area, and is the total flux from the triangle, and 
^polygon * s 2X63 on plane of the intersection of the projection of 

the triangle and the pixel. In the case of Lambertian reflection, this is given by 

= pE(a s ) cos a v (cos 9) K AQ, (12) 

E(a s )=A{Z s cos a s +Z a ). 

AQ = S/d 2 , 

Here p is an average albedo of the triangular facet. Orientation angles a s and 
a v are defined in figure 1 . E(a s ) is the total radiation flux incident on the 
triangular facet with area A. This flux is modeled as a sum of two terms. The first 
term corresponds to direct radiation with intensity X s from the light source at 
infinity (commonly the sun). The second term corresponds to ambient fight with 
intensity I a . The parameter 6 in equation (12) is the angle between the camera 
axis and the viewing direction (the vector from the surface to the camera); k 
is the lens falloff factor. AQ in (12) is the solid angle subtended by the camera 
which is determined by the area of the lens S and the distance d from the centroid 
of the triangular facet to the camera. If shadows are present on the surface the 
situation is somewhat more complex. In this paper we assume that there are no 
shadows or occlusions present in the images. However the presence of shadows 
and occlusions, wfiflsY makmg~more~ complex the' computation ~ot the- image [-7,- 
8] and its derivatives, should lead to a more precise and robust surface estimate, 
as long-range correlations are incorporated into the estimation. 

The overall complexity of the rendering procedure and derivative calculation 
procedure scales as 

„ , pixel area ,, 

C = # images x# mangles x (13) 

This can be seen from the algorithmic outline of the rendering step: 



loop over images 
loop over surface triangles 
loop over affected pixels 
calculate fractions and derivatives 
calculate light contribution and 
derivatives 

pixelintensity += light * fractions 
end 
end 
end 

Where the fractions are those in equation 11. The corresponding derivatives 
are efficiently calculated as shown in the next section. 


3 Efficient Derivative Computation 


To compute the MAP estimates of {z.p} and © we must compute both the 
image /(z,p,@) and the derivative matrices D z , and D@. 

The derivatives with respect to the albedo values can easily be derived from 
equations 10 and 12. Note that because 0 < p < 1, in practice, we work with 
transformed albedo values, where p -> log(p/(l - p)). 

Denoting by u the component of z or & that we are currently considering, 
the pixel intensity derivatives with respect to u have two components 



d$ A 

du 



(14) 


The first component is due to changes in angle - as the height of a vertex changes, 
the normal to the facet changes, and so the derivative has a component due to the 
change in angle between the normal and the sun direction; as the camera changes 
position, the angle between the normal and the ray to the camera changes. 

Consider first d$&/d&i. We neglect the derivatives with respect to the fall- 
off angle, 6 y as their contribution will be small, and so it is clear from equation 
12 that the derivative with respect to any of the camera orientation angles is 
zero. 

. _ The derivative, with respect.to the camera position parameters is given by 


<9#a 

~d©~ 


OC 


d 

d©i 


cos a v 


= -(Zi - Z v (z v .Zi)) 
V 


(15) 


where v is the vector from the triangle to the camera, v = |v|, <9* are the three 
components of the camera position, z\ are unit vectors in the three coordinate 
directions and z v = v/v (see figure 1). n is the normal to the triangular facet. 


Consider now the derivative with respect to the height of one of the mesh 
vertices, Zj. The flux derivative, d$/dz- u can be computed directly from the 
coordinates of the triangle vertices and the camera position using equation 12. 
For the surface triangle with vertices (Pi 0 , P^ , Pj 2 ) the flux derivative with 
respect to the z component of the vertex Pi 0 equals 

{16) 


where 

g ^ "Ir y (z v cos ots *4“ z$ cos oc.y n cos ol§ cos ca-y ) -f- Xq z v 

and z is a unit normal in the vertical direction. 

For a triangle that projects entirely within a pixel, this completes the deriva- 
tive computation - the second term in equation 14 is the derivative of the frac- 
tional area of the triangle that projects into the pixel. 

3.1 Fractional Area Derivatives 

When the height of a vertex, z, changes, its projection on the image plane, P, 
also moves, by 5P . This gives rise to a change SA& in the area of the projection 
of the triangle, and also the change ^polygon ‘ m ^ polygon area. It follows from 
equation 11 that 


( ^polygon _ fP <9Aa \ dPj Q , , 

dz io i A ^ 3P io /a dP io J dz\ 0 ■ 

where the point displacement derivative dPi Q /dzi Q can be found in [13]. 

However, when the camera parameters change, the positions of the projec- 
tions of all the mesh vertices into the image plane will move. Then the the 
derivative of the fractional area is simply a sum of all three position changes and 
is given by 


d/a _ _1_ Y" ( ^poiygss _ f p dP j /-rg'j 

d&i Aa . fi . I 5Pj /A <9Pj / dOi 1 ' 

The point displacement derivatives are again in [13]. 

Thus, tfre tas^ the derivative of the area fraction given _in equa- 

tion 18 is reduced to the computation of dA&/dP$ and dAp 0 jyg 0n /<9Pj. Note 
that the intersection of a triangle and a pixel for a rectangular pixel boundary 
can, in general, be a polygon with 3 to 7 edges with various possible forms. 
However the algorithm for computing the polygon area derivatives that we have 
developed is general, and does not depend on a particular polygon configuration. 
The main idea of the algorithm can be described as follows. Consider, as an ex- 
ample, the polygon shown in figure 2 which is a part of the projected surface 
triangle with indices i 0 , ii, \ 2 - We are interested in the derivative of the polygon 




Fig. 2 . The intersection of the projection of a triangular surface element (io, ii, 12) onto 
the pixel plane with the pixel boundaries. Bold lines corresponds to the edges of the 
polygon resulting from the intersection. Dashed lines correspond to the new positions 
of the triangle edges when point P; 0 is displaced by SP 


area with respect to the point Pi 0 that connects two edges of the projected tri- 
angle, (P i2 ,Pi 0 ) and (PjojPjJ. These triangular edges contain segments (I, J) 
and (K, L) that are sides of the corresponding polygon. It can be seen from 
figure 2 that when the point Pi 0 is displaced by JPi 0 the change in the polygon 
area is given by the sum of two terms 


^■polygon = + $Ak,L 


These terms are equal to the areas spanned by the two corresponding segments 
taken with appropriate signs. Therefore the polygon area derivative with respect 
to the triangle vertex P io is represented as a sum of the two “segment area” 
derivatives for the two segments adjacent to a given vertex. Using straightforward 
geometrical arguments one can calculate the areas 8A\ y j and 5Ak,l to first order 
in the displacement £Pi 0 . Then the polygon area derivative can be written in 
the following form: 


^polygon _ 1 a - \ ° 1 

<9P io ~ 2 ’ [-10 


(19) ' 


The unit antisymmetric matrix a performs a -7r/2 rotation in the image plane 
and vector W equals 


w = [(1 - R\) - (P i0 - P i3 ) 

+ [(i-j&)-*y (Ph-p*.)- (20) 


! 
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The ratio factors R determine the positions of the intersection points I, J, K 7 L 
on the edges of the triangle (see figure 2). 

Equations 19 and 20 are the central result of the area fraction derivative com- 
putation. It is given for the general case of triangle-pixel intersection where two 
edges of triangle adjacent to the vertex Pi 0 each have two intersection points. 
Note that pairs of intersection points, I, J and K, L are defined in a unique way 
if one considers the triangle edges in counterclockwise order. Therefore equations 
19- 20 can be applied to all possible intersection cases. For example, assume that 
all three triangle vertices axe projected inside the pixel. In this case intersection 
point K has merged with Pi 2 , points L and I have merged with P| Q and J with 
Pi 1 . Then in equation 20 we should put 

R k ^R l ^R 1 =Rj=0. ( 21 ) 

In this case polygon area derivative in equation 19 is reduced to the derivative 
of the full area of the projected triangle 

( 22 ) 

The general rule for computing the ratio factors i?i,j,K,L can be formulated 
as follows: 

— If point Pi 0 lies inside of the pixel one should set in equation 20 ratio factors 
Ri_ ~ 0 and Rj ~ 0. 

— If point Pj 2 lies inside of the pixel then one sets f?K = 0. 

— If Pi 2 lies inside then Rj = 0. 

This describes all possible intersection cases and provides a full description for 
the area fraction derivative (18). 

Further details of the derivative computation, together with full details of 
the point displacement derivatives, can be found in [13]. 

4 Results 

We present here the results of applying our methodology. We will demonstrate 
our contention that the small improvements made by our registration method 
to the camera parameter estimates results in large improvements to the quality 

of the inferred surface: - — - — - 

Figure 3 shows four synthetic images of a region of Duckwater, Nevada. They 
were generated by rendering a synthetic surface. The surface was constructed by 
using the USGS Digital Elevation Model for the heights, and using the scaled 
intensities of a LANDS AT-TM image as surrogate albedos. The size of the sur- 
face is 301 x 301 points. The distance between grid points was taken to be one 
unit, and the heights scaled appropriately. Figure 4 shows a perspective view of 
the surface with expanded vertical scale. Table 1 gives the camera parameters 
that were used to generate the images. 




Fig. 3. Four synthetic images of Duckwater, Nevada 


An initial estimate of the camera parameters was made by using point match- 
ing [12]. We have found that the Harris corner detector [11] typically used to 
select features does not find many reliable features in the types of natural scenery 
we are concerned with here. Table 1 gives the parameters estimated by matching 
five points across the four images. Note that these camera parameter estimates 
appear accurate, with the major error being in the orientation angle (view-up 
vector) . 

Using these estimated camera parameters, a dense surface estimate can be 
made. For space reasons we do not show the surface estimate, instead, in figure 
7 we show the error surface, and a cross section in figure 6. The main points to 
note are that 

1. the small inaccuracies in the camera parameter estimation have resulted in 
an erroneous slope in the surface estimate. 

2. the overall height of the surface Is shifted upwards; but note that the overall 
shift is a small percentage (less than 0.5%) of the distance from the surface 
to the cameras. The overall height is only weakly determined. 

3. the albedo estimates are in general quite good (the RMSE for the albedo 
estimate is 0.022). 

Using the gradient-based, whole image, approach to camera calibration to 
a surface, that we have described above, we then registered the images to the 
surface estimate. Using the new camera parameters, we re-estimated the surface. 




Fig. 4. True surface 


This was iterated three times. On a 1.2GHz Athlon PC, rendering and computing 
the derivative matrix takes less than 2 seconds per image. Convergence of the 
Conjugate Gradient for updating the surface estimate is achieved in around 200 
seconds, and for updating the camera parameters in less than a second. Table 1 
gives the final camera parameters, and figure 5 show's the final surface estimate. 
Again, note that the improvements to the registration parameters are small, but 
figures 8 and 6 show r that these small improvements are crucial. Figure 8 is the 
error surface and figure 6 is a section through the error surfaces. We note the 
following: 


1. the main improvement in the camera parameter estimation is in the orien- 
tation angle, defined by the view'-up vector 

2. the erroneous slope has been corrected 

3. the error in the global height remains 

4. the estimate show's most inaccuracies close to rapid changes in albedo, for 
example the -white -(salt lake) area, to the top right _of the surface, wmere 
albedo and slope effects have not been completely decoupled. 


From these numerical experiments, it is clear that the quality of the surface 
inference is very sensitive to even small changes in the camera parameters. The 
convergence radius of a successful surface reconstruction with respect to the cam- 
era parameters is quite small, and therefore the improvements our registration 
method give, w'hilst appearing to be small, have a large effect on the accuracy 
of the surface estimate. 


5 Conclusions 


In this paper we have described a system that takes a set of images and uses 
them to infer both the camera parameters and a dense surface model It does 
this by iterative linearization of a model of the image formation process, and 
minimization of the error between the whole of the observed and rendered im- 
ages with respect to the camera and surface parameters. We have demonstrated 
the convergence of this system on a set of images rendered from a model of a 
region of Nevada. We have demonstrated the need for extremely accurate camera 
registrations in order to accurately infer a dense surface model, and have shown 
that our registration method achieves this. 

Though the computational cost of our system is high compared to a con- 
ventional 3D reconstruction algorithm, it is still of linear complexity, and he 
system we have described has many advantages. The accurate, dense surface 
reconstruction which also has albedo information can be used for a number of 
scientific applications, for example spectroscopy for remote mineral type deter- 
mination. The scale of the surface model that is estimated is decoupled from the 
pixel scale of the images via the rendering process. This means that the surface 
model scale can be chosen by the user, either on the basis of the use to which 
the surface model will be put, or a scale may be chosen which is best justified by 
the image data. This is important - if we have many low resolution images of a 
region, the scale of the surface model may be super-resolved (where a triangular 
surface element projects onto an area smaller than a pixel on the image plane) . 
If the coverage of the surface by the images is non-uniform, we can specify a 
spatially-varying mesh for the surface, denser in regions where we have more 
images. 

The information about the surface captured by the system is not just the 
MAP surface estimate, but also the accuracy of the estimate, represented by 
the inverse covariance matrix (A in equation 5) . Knowing the inverse covariance 
matrix allows for recursive updates - as new images become available the infor- 
mation they contain can be integrated into the model. In Bayesian terminology, 
the posterior distribution from one set of images (defined by the MAP estimate 
and the inverse covariance matrix) becomes the prior for estimation with new 
images. 

Finally, we are not restricted to only image data. If data from other sensing 
modalities is available (for example, laser altimetry data) then we can add a 
term to the likelihood (equation 2) for this data, take derivatives of a model 
of “how this new sensor makes measurements with “respect To -the -surface model 
parameters, and our surface model estimate will seamlessly integrate the multi- 
modal information. 
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